Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change schema_infer_max_rec config to use Option<usize> rather than usize #13250

Merged
merged 2 commits into from
Nov 6, 2024

Conversation

alihan-synnada
Copy link
Contributor

@alihan-synnada alihan-synnada commented Nov 4, 2024

Which issue does this PR close?

No related issue

Rationale for this change

I came across the following obstacles trying to write some custom schema inference code:

  1. There are cases where we need to know if the user explicitly set the schema_infer_max_rec config.
  2. The stream passed into read_to_delimited_chunks_from_stream does not necessarily have to be BoxStream<'static>

What changes are included in this PR?

  1. Make CsvOptions::schema_infer_max_rec and JsonOptions::schema_infer_max_rec Options
  2. Add lifetime parameter to functions in CSV and compression that makes the compiler complain

Are these changes tested?

Tested with existing code and tests.

Are there any user-facing changes?

  1. CsvOptions::schema_infer_max_rec and JsonOptions::schema_infer_max_rec are now Option<usize>s as opposed to usize (with_schema_infer_max_rec signature is unchanged)
  2. read_to_delimited_chunks, read_to_delimited_chunks_from_stream, convert_stream, convert_to_compress_stream now take lifetime parameters

I believe (1) is a breaking change but I'm not sure about (2)

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate proto Related to proto crate labels Nov 4, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alihan-synnada

There are cases where we need to know if the user explicitly set the schema_infer_max_rec config.

It seems like checking for 0 would be sufficient to see if they set the option, as 0 is not a valid value for schema inference (you can't infer a schema with no records) 🤔

Using an Option seems fine to me as it is more consistent, but I do think we should keep the CSV and JSON options consistent

@@ -1773,7 +1773,7 @@ config_namespace! {
/// Options controlling JSON format
pub struct JsonOptions {
pub compression: CompressionTypeVariant, default = CompressionTypeVariant::UNCOMPRESSED
pub schema_infer_max_rec: usize, default = 100
pub schema_infer_max_rec: Option<usize>, default = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make this change to be Option then it is inconsistent with the JSON options:

pub compression: CompressionTypeVariant, default = CompressionTypeVariant::UNCOMPRESSED
pub schema_infer_max_rec: usize, default = 100

I think it would be best to keep the two options consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did change both CsvOptions::schema_infer_max_rec and JsonOptions::schema_infer_max_rec so they are consistent. I think the PR body was confusing because of the 2nd change (lifetimes) mentioning CSV. I'll rephrase the PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see -- I was clearly confused

@alamb alamb added the api change Changes the API exposed to users of the crate label Nov 4, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alihan-synnada -- this makes sense to me and is consistent with other options

@@ -1773,7 +1773,7 @@ config_namespace! {
/// Options controlling JSON format
pub struct JsonOptions {
pub compression: CompressionTypeVariant, default = CompressionTypeVariant::UNCOMPRESSED
pub schema_infer_max_rec: usize, default = 100
pub schema_infer_max_rec: Option<usize>, default = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see -- I was clearly confused

@alamb alamb changed the title Revise schema inference to use Option config and lifetime parameters Change schema_infer_max_rec config to use Option<usize> rather than usize Nov 5, 2024
Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM also, thank you for the review @alamb

@berkaysynnada berkaysynnada merged commit 39aa15e into apache:main Nov 6, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate common Related to common crate core Core DataFusion crate proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants